CorpusExplorer: Supporting a Deeper Understanding of Linguistic Corpora

نویسندگان

  • Andrés Esteban
  • Roberto Therón
چکیده

Word trees are a common way of representing frequency information obtained by analyzing natural language data. This article explores their usage and possibilities, and addresses the development of an application to visualize the relative frequencies of 2-grams and 3-grams in Google’s ”English One Million” corpus using a two-sided word tree and sparklines to show usage trends through time. It also discusses how the raw data was processed and trimmed to speed up access to it.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Open Linguistic Infrastructure for Annotated Corpora

Annotated corpora are a fundamental resource for research and development in the field of natural language processing (NLP). Although unannotated corpora (for example, Gigaword, Wikipedia, etc.) are often used to build language models, annotations for linguistic phenomena provide a richer set of features and hence, potentially better models in the long run. It is widely accepted that a first st...

متن کامل

A Cross-linguistic and Cross-cultural Study of Epistemic Modality Markers in Linguistics Research Articles

Epistemic modality devices are believed to be one of the prominent characteristics of research articles as the commonly used genre among the academic community members. Considering the importance of such devices in producing and comprehending scientific discourse, this study aimed to cross–culturally and cross-linguistically investigate epistemic modality markers as an important subcategory...

متن کامل

Conceptualizing Sensory Relativism in Light of Emotioncy: A Movement beyond Linguistic Relativism

Given the significance of relativism in molding our worldview and uncovering the nature of truth, this study using the newly-developed concept of emotioncy, attempted to introduce sensory relativism as a new perspective based on which senses can relativize our understanding of the world. To espouse the theory, 24 individuals were interviewed on their experiences...

متن کامل

Towards deeper understanding of the latent semantic analysis performance

The paper studies the factors influencing the performance of the Latent Semantic Analysis. Unlike previous related research that concentrates on parameters such as matrix elements weighting, space dimensionality, similarity measure etc., we address the impact of another fundamental factor: the definition of “word”. For the purpose, series of experiments were performed on two corpora in order to...

متن کامل

Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank

Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011